This block is for importing necessary libraries and loading the COVID-19 confirmed cases datasets.
import pandas as pd
# Load the dataset
file_path = 'time_series_covid19_confirmed_US.csv'
data = pd.read_csv(file_path)
The script displays the first few rows and the column headers of the dataset.
# Display the first few rows and the columns of the dataset
data.head(), data.columns
( UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region \
0 84001001 US USA 840 1001.0 Autauga Alabama US
1 84001003 US USA 840 1003.0 Baldwin Alabama US
2 84001005 US USA 840 1005.0 Barbour Alabama US
3 84001007 US USA 840 1007.0 Bibb Alabama US
4 84001009 US USA 840 1009.0 Blount Alabama US
Lat Long_ ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 \
0 32.539527 -86.644082 ... 19732 19759 19759 19759 19759 19759
1 30.727750 -87.722071 ... 69641 69767 69767 69767 69767 69767
2 31.868263 -85.387129 ... 7451 7474 7474 7474 7474 7474
3 32.996421 -87.125115 ... 8067 8087 8087 8087 8087 8087
4 33.982109 -86.567906 ... 18616 18673 18673 18673 18673 18673
3/6/23 3/7/23 3/8/23 3/9/23
0 19759 19759 19790 19790
1 69767 69767 69860 69860
2 7474 7474 7485 7485
3 8087 8087 8091 8091
4 18673 18673 18704 18704
[5 rows x 1154 columns],
Index(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
'Country_Region', 'Lat', 'Long_',
...
'2/28/23', '3/1/23', '3/2/23', '3/3/23', '3/4/23', '3/5/23', '3/6/23',
'3/7/23', '3/8/23', '3/9/23'],
dtype='object', length=1154))
The latest date is extracted from the dataset and calculates the total confirmed cases for each region as of that date. It then detect outliers, defining any region's case count that lies beyond 1.5 times the IQR from the first and third quartiles as outliers. Resulting in the preview of these outliers and their count.
# Extracting the latest date from the dataset to calculate the total confirmed cases up to that date
latest_date = data.columns[-1] # Assumes the last column is the latest date
# Calculate total cases for each region as of the latest date
data['Total_Cases_Latest'] = data[latest_date]
# Using the Interquartile Range (IQR) to detect outliers
Q1 = data['Total_Cases_Latest'].quantile(0.25)
Q3 = data['Total_Cases_Latest'].quantile(0.75)
IQR = Q3 - Q1
# Define outliers as regions where the total cases are beyond 1.5 times the IQR from the Q1 or Q3
outlier_condition = ((data['Total_Cases_Latest'] < (Q1 - 1.5 * IQR)) |
(data['Total_Cases_Latest'] > (Q3 + 1.5 * IQR)))
outliers = data.loc[outlier_condition, ['Admin2', 'Province_State', 'Total_Cases_Latest']]
outliers.head(), outliers.shape
( Admin2 Province_State Total_Cases_Latest 1 Baldwin Alabama 69860 36 Jefferson Alabama 238727 40 Lee Alabama 47646 44 Madison Alabama 116086 48 Mobile Alabama 134986, (442, 3))
The total COVID-19 cases are aggregated by state and calculates the mean and standard deviation of these totals. Then, it identifies states with cases that are more than two standard deviations from the mean as outliers.
import plotly.express as px
import pandas as pd
import numpy as np
# Assuming 'data' is already loaded and contains a column 'Province_State' for states
# and 'Total_Cases_Latest' for the latest total cases
# Aggregate total cases by state
state_cases = data.groupby('Province_State')['Total_Cases_Latest'].sum().reset_index()
# Calculate the mean and standard deviation for the cases
mean_cases = state_cases['Total_Cases_Latest'].mean()
std_cases = state_cases['Total_Cases_Latest'].std()
# Identify outliers (e.g., cases that are more than 2 standard deviations from the mean)
state_cases['Outlier'] = np.abs(state_cases['Total_Cases_Latest'] - mean_cases) > 2 * std_cases
# Create an interactive scatter plot
fig = px.scatter(state_cases, x='Province_State', y='Total_Cases_Latest',
color='Outlier', # Use the Outlier column to set color: true for outlier, false for not
color_continuous_scale=px.colors.sequential.Viridis, # Color scale
hover_name='Province_State', # Shows state name on hover
hover_data={'Total_Cases_Latest': True, 'Outlier': False}, # Shows cases on hover, hide outlier boolean
labels={'Total_Cases_Latest': 'Total Cases', 'Province_State': 'State'},
title='Interactive Scatter Plot of Total COVID-19 Cases for Each State Highlighting Outliers')
# Show the plot
fig.show()
import plotly.express as px
# Sort the outliers by 'Total_Cases_Latest' in descending order and take the top 10
outliers_sorted = outliers.sort_values(by='Total_Cases_Latest', ascending=False).head(10)
# Convert 'Total_Cases_Latest' to millions
outliers_sorted['Total_Cases_Latest_Millions'] = outliers_sorted['Total_Cases_Latest'] / 1_000_000
# Create an interactive bar chart for the top 10 outlier regions
fig = px.bar(outliers_sorted, x='Total_Cases_Latest_Millions', y=outliers_sorted['Admin2'] + ", " + outliers_sorted['Province_State'],
labels={'x': 'Total Cases (millions)', 'y': 'Region'},
title='Top 10 Total COVID-19 Cases in Outlier Regions (Millions)')
# Display the plot
fig.show()
The Python script performs a detailed analysis of COVID-19 confirmed cases in the U.S., identifying outliers both at regional and state levels using statistical methods like Interquartile Range (IQR) and standard deviation measures.
Key observations are - States and regions significantly deviating from typical case counts are marked as outliers, to focus on areas with unusual patterns. Visual tools like scatter plots and bar charts illustrate the distribution of cases, also highlighting states with exceptional numbers. By this analysis we can understand the areas might need more intense public health interventions or resource allocations due to their atypical case numbers.